Authorship Identification for Heterogeneous Documents
نویسندگان
چکیده
The study of authorship identification in Japanese has for the most part been restricted to literary texts using basic statistical methods. In the present study, authors of mailing list messages are identified using a machine learning technique (Support Vector Machines). In addition, the classifier trained on the mailing list data is applied to identify the author of Web documents in order to investigate performance in authorship identification for more heterogeneous documents. Experimental results show better identification performance when we use the features of not only conventional word N-gram information but also of frequent sequential patterns extracted by a data mining technique (PrefixSpan).
منابع مشابه
Local n-grams for Author Identification Notebook for PAN at CLEF 2013
Our approach to the author identification task uses existing authorship attribution methods using local n-grams (LNG) and performs a weighted ensemble. This approach came in third for this year’s competition, using a relatively simple scheme of weights by training set accuracy. LNG models create profiles, consisting of a list of character n-grams that best represent a particular author’s writin...
متن کاملA Framework for Authorship Identification in the Internet Environment
Misuse of anonymous online communication for illegal purposes has become a major concern [2,12]. In this paper, we present a framework named ART (Authorship Recognition Tool), that is designed to minimize manual procedures and maximize the efficiency of authorship identification based on the content of Internet electronic documents. The framework covers the phases of document retrieval and data...
متن کاملThe Keyboard Dilemma and Authorship Identification
The keyboard dilemma is the problem of identifying the authorship of a document that was produced by a computer to which multiple users had access. This paper describes a systematic methodology for authorship identification. Validation testing of the methodology demonstrated 95% cross validated accuracy in identifying documents from ten authors and 85% cross validated accuracy in identifying fi...
متن کاملCo-authorship network analysis and social network indicators of coronavirus research
Background and aim: The aim of this study was to examine the status of documents related to coronavirus based on scientometric indicators and to draw a co-authorship map of authors, organizations and countries producing an article to get to know this field as much as possible. Materials and methods: This applied-scientometric was conducted using social network analysis. The statistical populati...
متن کاملAuthorship Verification Using the Impostors Method Notebook for PAN at CLEF 2013
This paper describes the evaluation of the GenIM method, which participated in the PAN' 13 authorship identification competition. The approach is based on comparing the similarity between the given documents and a number of external (impostor) documents, so that documents can be classified as having been written by the same author, if they are shown to be more similar to each other than to the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002